Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several fixes for writing large Arrow tables #57

Merged
merged 4 commits into from
Nov 4, 2020
Merged

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Nov 3, 2020

No description provided.

Fixes #56. The issue was the underlying flatbuffer array type was using
an `UInt32` for the byte position where the flatbuffer array was
located; in super large files, this overflowed. As this Array field is a
Julia-side controlled type, we can easily switch to an Int64 to avoid
this issue all together.
The bitpack encoding algorithm was allocating, which caused very large
tables to slow down considerably with so much memory recycling.
Rewriting it to avoid allocations leads to drastically fewer allocations
and much faster writing performance. For non-optimized array writing, we
also switch to writing to a buffer first to avoid hitting the global IO
lock too much, which can also hurt performance on large files.
Just a few minor cleanups to ensure dictionary encoding types are
consistent, and that variable names work correctly between writing
first-time dictionary encodings and deltas.
@codecov
Copy link

codecov bot commented Nov 3, 2020

Codecov Report

Merging #57 into master will decrease coverage by 0.06%.
The diff coverage is 93.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
- Coverage   83.34%   83.27%   -0.07%     
==========================================
  Files          23       23              
  Lines        2642     2649       +7     
==========================================
+ Hits         2202     2206       +4     
- Misses        440      443       +3     
Impacted Files Coverage Δ
src/FlatBuffers/table.jl 53.84% <ø> (ø)
src/eltypes.jl 82.90% <33.33%> (-0.72%) ⬇️
src/arraytypes/dictencoding.jl 80.00% <90.90%> (+0.15%) ⬆️
src/arraytypes/arraytypes.jl 88.09% <100.00%> (+0.14%) ⬆️
src/arraytypes/bool.jl 85.24% <100.00%> (ø)
src/table.jl 95.60% <100.00%> (ø)
src/utils.jl 82.79% <100.00%> (-0.54%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8583da8...6b3f9b9. Read the comment docs.

@quinnj quinnj merged commit e07c7cc into master Nov 4, 2020
@quinnj quinnj deleted the jq/bigtables branch November 4, 2020 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant